Challenge 2: Data Modeling

Author

ESADE MiBA Group 10

Published

14/03/2025

Introduction

This exercise focuses on applying tidymodels to build predictive models for the Airbnb dataset from the first challenge. After data cleaning and transformation, the goal is to streamline model creation, evaluation, and interpretation using tidymodels, a modular and extensible framework that follows tidyverse principles. It provides a consistent interface for preprocessing, model specification, tuning, and evaluation, supporting various algorithms while integrating seamlessly with external libraries.

Importing Necessary R Libraries

The following R libraries are necessary for performing the data manipulation tasks:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.7     ✔ rsample      1.2.1
✔ dials        1.4.0     ✔ tune         1.3.0
✔ infer        1.0.7     ✔ workflows    1.2.0
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.3.0     ✔ yardstick    1.3.2
✔ recipes      1.1.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
library(data.table)

Attaching package: 'data.table'

The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose
library(corrplot)
corrplot 0.95 loaded
library(vip)        

Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi
library(scales)
library(janitor)    

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(skimr)     
library(future)
library(leaflet)
library(ggplot2)
library(htmltools)
library(htmlwidgets)
library(gt)
library(dials)



plan(multisession, workers = parallel::detectCores() - 1) 

# Define helper function to extract model coefficients from resampling
get_coefs <- function(model) {
  model_fit <- extract_fit_engine(model)
  xgb.importance(model = model_fit)
}

Customizing the Theme for Plotting

theme_custom <- theme_light() +
  theme(plot.title = element_text(face = 'bold', colour = 'black', size = 8),
        axis.text = element_text(face = 'bold', colour = 'black', size = 8),
        panel.grid = element_blank(),
        strip.background = element_blank(),
        strip.text = element_text(face = 'bold', colour = 'black', size = 8),
        legend.position = 'bottom')

Exercises

Loading the Dataset

  • We load the Airbnb dataset and clean column names to ensure consistent formatting for analysis.
airbnb_clean <- read.csv('/Users/raissaangnged/Downloads/airbnb_clean.csv') %>%
  clean_names()

Exercise 1: Split data into train and test sets

  • We split the dataset into training (75%) and test (25%) sets for model development and evaluation. The set.seed(123) ensures reproducibility.
set.seed(123) # ensures reproducibility of the random split
airbnb_split <- initial_split(airbnb_clean) # default split (75% train, 25% test)
airbnb_train <- training(airbnb_split)
airbnb_test <- testing(airbnb_split)

airbnb_split
<Training/Testing/Total>
<14874/4959/19833>

Exercise 2: Create a 10-fold cross-validation object

  • A 10-fold cross-validation structure is created using vfold_cv(airbnb_train, v = 10), ensuring robust model evaluation by repeatedly training on nine subsets and validating on the tenth. This technique helps mitigate overfitting by providing multiple training-validation cycles, leading to a more generalized model. The seed is again set to 123 to ensure that the partitioning remains consistent across executions.
set.seed(123)
airbnb_folds <- vfold_cv(airbnb_train, 
                         v = 10,
                         strata = neighbourhood_group_cleansed) # to ensure balance splits across categories

airbnb_folds
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits               id    
   <list>               <chr> 
 1 <split [13385/1489]> Fold01
 2 <split [13385/1489]> Fold02
 3 <split [13385/1489]> Fold03
 4 <split [13385/1489]> Fold04
 5 <split [13387/1487]> Fold05
 6 <split [13387/1487]> Fold06
 7 <split [13387/1487]> Fold07
 8 <split [13387/1487]> Fold08
 9 <split [13389/1485]> Fold09
10 <split [13389/1485]> Fold10

Exercise 3: Create a recipe, define metrics, and build a tidymodels workflow with XGBoost.

  • A preprocessing recipe (airbnb_recipe) is developed to prepare data for modeling by normalizing numerical predictors, applying one-hot encoding to categorical features, and removing date columns. The recipe also eliminates near-zero variance predictors and highly correlated numeric variables, ensuring the dataset is optimized for xgboost, which requires all input features to be numeric. A model workflow (airbnb_wf) is constructed, incorporating the preprocessing recipe, an xgboost regression model with tunable hyperparameters, and a predefined evaluation metric set (airbnb_metrics) to assess model performance.
# Identify column types 
cat_cols <- airbnb_clean %>%
  select(where(~ is.character(.x) | is.factor(.x))) %>%
  names()

num_cols <- airbnb_clean %>%
  select(where(is.numeric)) %>%
  names() %>%
  setdiff(c("price"))

airbnb_recipe <- recipe(log_price ~ ., data = airbnb_train) %>%
  step_rm(matches("date"), matches("id"), host_id, first_review, last_review, price) %>%
  step_mutate_at(all_nominal_predictors(), fn = as.character) %>% 
  step_mutate_at(contains("has_"), fn = ~ as.numeric(.)) %>%
  step_mutate_at(matches("host_is_superhost|is_location_exact|instant_bookable"), fn = ~ as.numeric(.)) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_zv(all_predictors()) %>%
  step_corr(all_numeric_predictors(), threshold = 0.9)

train_recipe <- airbnb_recipe %>% prep() %>% bake(airbnb_train)

# Define metrics
airbnb_metrics <- metric_set(
  rmse, mae, rsq, mape
)

# Create model spec
xgb_spec <- boost_tree(
  trees = tune(),
  learn_rate = tune(),
  tree_depth = tune(),
  min_n = tune(),
  sample_size = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# Create model workflow
airbnb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_recipe(airbnb_recipe)

print(airbnb_metrics)
A metric set, consisting of:
- `rmse()`, a numeric metric | direction: minimize
- `mae()`, a numeric metric  | direction: minimize
- `rsq()`, a numeric metric  | direction: maximize
- `mape()`, a numeric metric | direction: minimize
print(airbnb_wf)
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
10 Recipe Steps

• step_rm()
• step_mutate_at()
• step_mutate_at()
• step_mutate_at()
• step_other()
• step_impute_median()
• step_impute_mode()
• step_dummy()
• step_zv()
• step_corr()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)

Main Arguments:
  trees = tune()
  min_n = tune()
  tree_depth = tune()
  learn_rate = tune()
  sample_size = tune()

Computational engine: xgboost 

Exercise 4: Tune hyperparameters using cross-validation and finalize the best model

  • A hyperparameter tuning process is implemented using grid_latin_hypercube to define a search space for critical parameters, including tree depth, learning rate, and the number of trees. The tuning is performed with tune_grid over a 10-fold cross-validation setup, optimizing for RMSE as the primary metric. The best combination of hyperparameters is selected using select_best(tune_results, “rmse”), and the workflow is finalized with finalize_workflow before training on the full dataset with last_fit.
xgb_params <- parameters(
  trees() %>% range_set(c(100, 500)),
  learn_rate() %>% range_set(c(0.01, 0.3)),
  tree_depth() %>% range_set(c(3, 6)),
  min_n() %>% range_set(c(5, 30)),
  sample_prop() %>% range_set(c(0.5, 1.0))
)

# Generate random tuning grid
xgb_grid <- grid_random(xgb_params, size = 10)

# Tune model
tune_results <- tune_grid(
  airbnb_wf,
  resamples = airbnb_folds,
  grid = xgb_grid,
  metrics = airbnb_metrics,
  control = control_grid(save_pred = TRUE)
)

# Select best model and finalize workflow
best_params <- select_best(tune_results, metric = "rmse")
final_wf <- finalize_workflow(airbnb_wf, best_params)

cv <- fit_resamples(
  object = final_wf, 
  resamples = airbnb_folds, 
  control = control_resamples(save_pred = TRUE),  
  metrics = airbnb_metrics
)

# Collecting the results (metrics, predictions, and coefficients) of the folds
cv_metrics <- pluck(cv, ".metrics") %>% 
  map_dfr(.f = ~ rbind(.x)) %>% 
  arrange(.metric)

# Fit final model on full dataset split
final_fit <- last_fit(
  final_wf,
  split = airbnb_split,
  metrics = airbnb_metrics
)

# Print final fit 
print(final_fit)
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits               id              .metrics .notes   .predictions .workflow 
  <list>               <chr>           <list>   <list>   <list>       <list>    
1 <split [14874/4959]> train/test spl… <tibble> <tibble> <tibble>     <workflow>
# Print metrics results
print(cv_metrics) 
# A tibble: 40 × 4
   .metric .estimator .estimate .config             
   <chr>   <chr>          <dbl> <chr>               
 1 mae     standard       0.396 Preprocessor1_Model1
 2 mae     standard       0.395 Preprocessor1_Model1
 3 mae     standard       0.401 Preprocessor1_Model1
 4 mae     standard       0.396 Preprocessor1_Model1
 5 mae     standard       0.403 Preprocessor1_Model1
 6 mae     standard       0.396 Preprocessor1_Model1
 7 mae     standard       0.397 Preprocessor1_Model1
 8 mae     standard       0.417 Preprocessor1_Model1
 9 mae     standard       0.389 Preprocessor1_Model1
10 mae     standard       0.397 Preprocessor1_Model1
# ℹ 30 more rows

Exercise 5: Extract final workflow and create airbnb_predictor for predictions

  • The trained model is extracted into airbnb_predictor, allowing for predictions on new data. This object encapsulates the final model after tuning and cross-validation, ensuring it can be used for real-world predictions without requiring reconfiguration. This step enables scalability, facilitating predictions on unseen listings while maintaining consistency with the trained model.
airbnb_predictor <- final_fit$.workflow[[1]]

airbnb_predictor
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
10 Recipe Steps

• step_rm()
• step_mutate_at()
• step_mutate_at()
• step_mutate_at()
• step_other()
• step_impute_median()
• step_impute_mode()
• step_dummy()
• step_zv()
• step_corr()

── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 470.8 Kb 
call:
  xgboost::xgb.train(params = list(eta = 1.09610223849259, max_depth = 3L, 
    gamma = 0, colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 19L, 
    subsample = 0.931667963974178), data = x$data, nrounds = 420L, 
    watchlist = x$watchlist, verbose = 0, nthread = 1, objective = "reg:squarederror")
params (as set within xgb.train):
  eta = "1.09610223849259", max_depth = "3", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "19", subsample = "0.931667963974178", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 102 
niter: 420
nfeatures : 102 
evaluation_log:
  iter training_rmse
 <num>         <num>
     1     0.7338121
     2     0.6155747
   ---           ---
   419     0.3113811
   420     0.3112381

Exercise 6: Predict on airbnb_clean and visualize accuracy

  • Predictions are generated for the entire dataset using predict(airbnb_predictor, new_data = df), and a scatter plot is created to visualize accuracy. The plot compares actual and predicted prices, with a red dashed identity line (geom_abline(slope = 1, intercept = 0)) serving as a reference for perfect predictions. The model’s effectiveness is reflected in how closely the points align with this line, indicating strong predictive performance.
# Make predictions on full dataset
predictions <- predict(airbnb_predictor, new_data = airbnb_clean)
df_with_predictions <- bind_cols(airbnb_clean, predictions)

# Model evaluation ---------------------------------------------------
# Calculate prediction errors
df_with_predictions <- df_with_predictions %>%
  mutate(error = .pred - log_price,
         abs_error = abs(error),
         pct_error = abs_error / log_price * 100)

# Create accuracy visualization
ggplot(df_with_predictions, aes(x = log_price, y = .pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Actual vs Predicted Prices",
    x = "Actual Price",
    y = "Predicted Price"
  ) +
  theme_custom

In this exercise, we used our trained model to predict Airbnb prices for the entire dataset and plotted the results to evaluate model performance. The x-axis represents actual prices, while the y-axis represents predicted prices. The red dashed line represents a perfect prediction, meaning that all points should ideally align with it.

Interpretation of the Results

  • The model generally follows the trend of actual prices, indicating that it captures overall price variations well.

  • There is some dispersion, especially at higher price points, suggesting that the model struggles more with expensive listings.

  • The presence of outliers indicates cases where the model significantly under or over-predicts prices.

  • The model appears to perform well for mid-range prices but shows increasing variance for high-price properties.

Exercise 7: Plot predictions on a geospatial map of Barcelona and analyze performance

  • A geospatial analysis of prediction errors is conducted using leaflet, mapping discrepancies between actual and predicted prices across Barcelona. The visualization represents each listing as a color-coded marker, where deviations from actual values are highlighted to identify areas of higher predictive variance. Additionally, separate neighborhood-level maps provide localized insights into model performance, helping assess pricing trends and potential systemic biases in different districts.
Neighborhoods Ranked Based on Largest Prediction Errors
Neighborhood Group Mean Error Max Error
nou barris 0.304 3.875
sarria-sant gervasi 0.286 2.681
les corts 0.282 2.496
sant marti 0.280 2.558
horta-guinardo 0.276 2.621
sants-montjuic 0.275 3.266
gracia 0.274 4.721
sant andreu 0.273 1.771
ciutat vella 0.272 2.787
eixample 0.271 4.010

Error Map of sant marti

Error Map of eixample

Error Map of gracia

Error Map of horta-guinardo

Error Map of les corts

Error Map of ciutat vella

Error Map of sants-montjuic

Error Map of sarria-sant gervasi

Error Map of nou barris

Error Map of sant andreu

The Leaflet map visualizes the prediction errors of Airbnb pricing across different neighborhoods in Barcelona. The color-coded markers indicate the level of error: green for low error (<10%), orange for moderate error (10-30%), and red for high error (>30%). The overwhelming presence of green markers suggests that the model generally performs well, with most predictions closely matching actual prices. However, some scattered orange and red markers indicate areas where the model struggles, potentially due to local pricing anomalies or data limitations. The interactive legend allows for filtering by neighborhood, which can provide further insights into regional variations in prediction accuracy.

For the map, there is an interactive checklist option where users can filter by neighborhood. This allows for easy exploration of different districts without switching between multiple maps, as well as allowing users to check out multiple areas in one view.

Exercise 8: Select properties within a €3M budget and estimate investment recovery time

  • A real estate investment strategy is formulated by selecting properties within a €3 million budget using predicted Airbnb rental income and estimated neighborhood occupancy rates. The analysis is based on a standardized listing profile (two bedrooms, two beds, one bathroom, and perfect ratings), with airbnb_predictor estimating potential earnings. The break-even period is calculated by projecting revenue over time, allowing for a data-driven approach to property acquisition and financial return assessment.
# Source the function to generate property data
source("/Users/raissaangnged/Downloads/create_sample.R")

set.seed(123)
budget <- 3000000

# Given avg_prices dataset
avg_prices <- data.frame(
  neighbourhood = c("eixample", "ciutat vella","sant marti","sants-montjuic", "sarria-sant gervasi",
                    "nou barris","horta-guinardo", "gracia","sant andreu","les corts"),
  avg_price = c(684012, 392645,435215,299140,980439, 201074,310891,500411,288534, 779088),
  pct_year_occupation = c(0.75, 0.8, 0.6, 0.7, 0.7, 0.5, 0.55, 0.7, 0.6, 0.7)
)

investment_analysis <- data.frame()

for (neighborhood in avg_prices$neighbourhood) {
  
  # Get the average price per property
  property_price <- avg_prices %>%
    filter(neighbourhood == neighborhood) %>%
    pull(avg_price)
  
  # Skip if price is missing
  if (length(property_price) == 0 | is.na(property_price)) next
  
  # Calculate the maximum number of properties we can buy
  max_properties <- floor(budget / property_price)
  
  # Create multiple property samples and average the predictions for more accuracy
  num_samples <- 10
  predicted_prices <- numeric(num_samples)
  
  for (i in 1:num_samples) {
    # Generate property listing for prediction with standardized features
    property_listing <- create_sample(  
      df = airbnb_clean,  
      df_prices = avg_prices,
      origin_sample = sample(1:nrow(airbnb_clean), size = 1, replace = FALSE), 
      neighbourhood_new = neighborhood,
      property_type_new = "apartment",
      room_type_new = "entire home/apt",
      accommodates_new = 4,
      bathrooms_new = 1,
      bedrooms_new = 2,
      beds_new = 2,
      cleaning_fee_new = 20,
      host_response_time_new = "within an hour",
      host_response_rate_new = 100,
      review_scores_rating_new = 100,
      review_scores_accuracy_new = 10,
      review_scores_cleanliness_new = 10,
      review_scores_checkin_new = 10,
      review_scores_communication_new = 10,
      review_scores_location_new = 10,
      review_scores_value_new = 10,
      minimum_nights_new = 1,
      maximum_nights_new = 30
    )
    
    # Ensure latitude and longitude columns exist in the property_listing
    if (!all(c("latitude", "longitude") %in% names(property_listing))) {
      # If missing, add them from the original sample
      origin_data <- airbnb_clean[sample(1:nrow(airbnb_clean), size = 1, replace = FALSE), ]
      if (!"latitude" %in% names(property_listing)) {
        property_listing$latitude <- origin_data$latitude
      }
      if (!"longitude" %in% names(property_listing)) {
        property_listing$longitude <- origin_data$longitude
      }
    }
    
    # Predict nightly price using airbnb_predictor and apply exponentiation
    tryCatch({
      predicted_prices[i] <- predict(airbnb_predictor, property_listing) %>%
        pull(.pred) %>%
        exp() # Convert from log back to actual prices
    }, error = function(e) {
      # If prediction fails, use a fallback method
      cat("Prediction error:", e$message, "\n")
      # Fallback: use average price from the neighborhood with a small random adjustment
      predicted_prices[i] <<- mean(df_with_predictions$price[df_with_predictions$neighbourhood_group_cleansed == neighborhood], 
                                  na.rm = TRUE) * (1 + rnorm(1, 0, 0.05))
    })
  }
  
  # Use the median prediction to reduce the impact of outliers
  predicted_price <- median(predicted_prices, na.rm = TRUE)
  
  # Get the estimated occupancy rate for the neighborhood
  occupancy_rate <- avg_prices %>%
    filter(neighbourhood == neighborhood) %>%
    pull(pct_year_occupation)
  
  # Skip if occupancy rate is missing
  if (length(occupancy_rate) == 0 | is.na(occupancy_rate)) next
  
  # Calculate expected annual revenue (assuming 365 days)
  annual_revenue_per_property <- predicted_price * occupancy_rate * 365
  
  # Calculate annual expenses (estimate 30% of revenue for maintenance, taxes, utilities)
  annual_expenses_per_property <- annual_revenue_per_property * 0.3
  
  # Calculate net annual revenue
  net_annual_revenue_per_property <- annual_revenue_per_property - annual_expenses_per_property
  
  # Total net revenue considering the number of properties
  total_net_annual_revenue <- net_annual_revenue_per_property * max_properties
  
  # Calculate payback period in years
  payback_period <- (property_price * max_properties) / total_net_annual_revenue
  
  # Calculate ROI (Return on Investment) percentage
  roi_percentage <- (net_annual_revenue_per_property / property_price) * 100
  
  # Store results
  investment_analysis <- bind_rows(investment_analysis, tibble(
    neighbourhood = neighborhood,
    property_price = property_price,
    max_properties = max_properties,
    predicted_nightly_price = predicted_price,
    occupancy_rate = occupancy_rate,
    annual_revenue_per_property = annual_revenue_per_property,
    net_annual_revenue_per_property = net_annual_revenue_per_property,
    total_investment = property_price * max_properties,
    total_net_annual_revenue = total_net_annual_revenue,
    payback_period_years = payback_period,
    roi_percentage = roi_percentage
  ))
}

# Sort by shortest payback period (best investments first)
investment_analysis <- investment_analysis %>%
  arrange(payback_period_years)

# Print the final investment strategy
print("Optimal real estate investment strategy:")
[1] "Optimal real estate investment strategy:"
print(investment_analysis)
         neighbourhood property_price max_properties predicted_nightly_price
1       sants-montjuic         299140             10               120.99179
2           nou barris         201074             14               103.60909
3          sant andreu         288534             10                85.81333
4         ciutat vella         392645              7                87.58214
5       horta-guinardo         310891              9               100.08863
6             eixample         684012              4               135.83260
7           sant marti         435215              6                99.85531
8               gracia         500411              5                98.11457
9            les corts         779088              3                98.78084
10 sarria-sant gervasi         980439              3                97.52873
   occupancy_rate annual_revenue_per_property net_annual_revenue_per_property
1            0.70                    30913.40                        21639.38
2            0.50                    18908.66                        13236.06
3            0.60                    18793.12                        13155.18
4            0.80                    25573.98                        17901.79
5            0.55                    20092.79                        14064.95
6            0.75                    37184.17                        26028.92
7            0.60                    21868.31                        15307.82
8            0.70                    25068.27                        17547.79
9            0.70                    25238.50                        17666.95
10           0.70                    24918.59                        17443.01
   total_investment total_net_annual_revenue payback_period_years
1           2991400                216393.82             13.82387
2           2815036                185304.87             15.19138
3           2885340                131551.84             21.93310
4           2748515                125312.52             21.93328
5           2798019                126584.59             22.10395
6           2736048                104115.69             26.27892
7           2611290                 91846.92             28.43089
8           2502055                 87738.96             28.51704
9           2337264                 53000.86             44.09861
10          2941317                 52329.04             56.20812
   roi_percentage
1        7.233864
2        6.582682
3        4.559318
4        4.559281
5        4.524079
6        3.805331
7        3.517301
8        3.506676
9        2.267645
10       1.779102
# Visualize the results
library(ggplot2)

# Plot ROI by neighborhood
ggplot(investment_analysis, aes(x = reorder(neighbourhood, roi_percentage), y = roi_percentage)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Return on Investment by Neighborhood",
       x = "Neighborhood",
       y = "ROI (%)") +
  theme_minimal()

# Plot payback period by neighborhood
ggplot(investment_analysis, aes(x = reorder(neighbourhood, -payback_period_years), y = payback_period_years)) +
  geom_bar(stat = "identity", fill = "coral") +
  coord_flip() +
  labs(title = "Payback Period by Neighborhood",
       x = "Neighborhood",
       y = "Years to Recover Investment") +
  theme_minimal()

Investment Scenarios with a €3M Budget

Based on the table, we can develop two different investment strategies:

•   Risky Scenario (High Potential Returns, More Diversification)
•   Less Risky Scenario (Stable Returns, Focused Investments)

1. Risky Investment Scenario (Diversified Approach)

This approach involves spreading the investment across multiple neighborhoods, balancing both high ROI and long-term appreciation potential.

Proposed Portfolio:

risky_scenario <- data.frame(
  Neighborhood = c("Sants-Montjuïc", "Nou Barris", "Sant Andreu", "Eixample", "Ciutat Vella"),
  Investment_Euros = c(1000000, 750000, 500000, 500000, 250000),
  Properties = c(3, 5, 2, 1, 1),
  Expected_ROI_Percentage = c(0.0723, 0.0658, 0.0455, 0.0380, 0.0455),
  Payback_Period_Years = c(13.82, 15.19, 21.93, 26.28, 21.93)
)

risky_scenario %>%
  gt() %>%
  tab_header(
    title = "Risky Investment Scenario"
  ) %>%
  fmt_currency(columns = c(Investment_Euros), currency = "EUR") %>%
  fmt_percent(columns = c(Expected_ROI_Percentage), scale = TRUE) %>%
  tab_options(
    table.font.size = "medium",
    heading.title.font.size = "large",
    heading.align = "left"
  )
Risky Investment Scenario
Neighborhood Investment_Euros Properties Expected_ROI_Percentage Payback_Period_Years
Sants-Montjuïc €1,000,000.00 3 7.23% 13.82
Nou Barris €750,000.00 5 6.58% 15.19
Sant Andreu €500,000.00 2 4.55% 21.93
Eixample €500,000.00 1 3.80% 26.28
Ciutat Vella €250,000.00 1 4.55% 21.93

Why This Approach?

•   Higher risk but higher potential upside due to the diverse mix of high ROI areas (Sants-Montjuïc, Nou Barris) and long-term appreciation neighborhoods (Eixample, Ciutat Vella).
•   Spreads risk across different areas, meaning if some neighborhoods underperform, others may compensate.
•   Includes Eixample & Ciutat Vella, which have higher property prices but strong Airbnb demand, leading to potential future price appreciation.

2. Less Risky Investment Scenario (Focused on Stable Returns)

This approach prioritizes stability and predictable cash flows, concentrating the investment in high ROI, low-risk neighborhoods.

Proposed Portfolio:

less_risky_scenario <- data.frame(
  Neighborhood = c("Sants-Montjuïc", "Nou Barris"),
  Investment_Euros = c(1500000, 1500000),
  Properties = c(5, 7),
  Expected_ROI_Percentage = c(0.0723, 0.0658),
  Payback_Period_Years = c(13.82, 15.19)
)

less_risky_scenario %>%
  gt() %>%
  tab_header(
    title = "Less Risky Investment Scenario"
  ) %>%
  fmt_currency(columns = c(Investment_Euros), currency = "EUR") %>%
  fmt_percent(columns = c(Expected_ROI_Percentage), scale = TRUE) %>%
  tab_options(
    table.font.size = "medium",
    heading.title.font.size = "large",
    heading.align = "left"
  )
Less Risky Investment Scenario
Neighborhood Investment_Euros Properties Expected_ROI_Percentage Payback_Period_Years
Sants-Montjuïc €1,500,000.00 5 7.23% 13.82
Nou Barris €1,500,000.00 7 6.58% 15.19

Why This Approach?

•   Lower risk, faster payback period (~14-15 years on average).
•   Sants-Montjuïc and Nou Barris have the best combination of high ROI and short payback periods.
•   Avoids high-cost, low-ROI neighborhoods (Sarria-Sant Gervasi, Les Corts, Gracia).
•   More predictable and stable returns from Airbnb rentals.

Final Conclusion:

Analyzing the options, we think the risky scenario is the best approach because it offers strong potential returns. Furthermore, diversifying our investment across multiple neighborhoods potentially reduces the risk of overly relying on a single market. By spreading our 3M budget across high-growth areas, we are able to maximize exposure to different demand dynamics while maintaining a good ROI.

Assuming we are young and rich (with a hypothetical €50M net worth), we think we can afford to take more calculated risks, at the potential of strategic, high-reward returns. We think this decision can position us for long-term dominance in the Barcelona Airbnb and real estate market.